[Trainer] Fix distributed dataloader #8932

DesmonDay · 2024-08-14T07:48:25Z

PR types

Bug fixes

PR changes

Others

Description

Fix distributed dataloader.
Fix rng state loading.
Fix uc unittest.

Distributed dataloader造成hang住的原因：主要针对iterable数据集的热启场景。原来的写法，数据进程的输入是iterable数据集，从而对应的sampler类型是Infinite类型；而非数据进程的数据输入是None，为None的情况下Paddle的Dataloader会自动设置sampler类型为batch sampler。由于在热启后一般会走跳过数据的逻辑，而跳过数据逻辑主要如下：

因此数据进程会走入第二个分支，而非数据进程会走入第一个分支，从而走入分支逻辑不一致导致卡住，从卡住时的堆栈可以看出具体问题。
数据进程的卡住见下图：

非数据进程的卡住见下图：

paddle-bot · 2024-08-14T07:48:29Z

Thanks for your contribution!

codecov · 2024-08-14T08:21:11Z

Codecov Report

Attention: Patch coverage is 32.69231% with 35 lines in your changes missing coverage. Please review.

Project coverage is 55.04%. Comparing base (75c7636) to head (2384c4d).
Report is 12 commits behind head on develop.

❗ Current head 2384c4d differs from pull request most recent head fd9ffba

Please upload reports for the commit fd9ffba to get more accurate results.

Files	Patch %	Lines
paddlenlp/trainer/trainer.py	34.09%	29 Missing ⚠️
paddlenlp/data/dist_dataloader.py	25.00%	6 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #8932      +/-   ##
===========================================
- Coverage    55.05%   55.04%   -0.02%     
===========================================
  Files          635      635              
  Lines        99412    99449      +37     
===========================================
+ Hits         54730    54739       +9     
- Misses       44682    44710      +28

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wawltor · 2024-08-15T07:12:46Z

paddlenlp/trainer/trainer.py

+                    train_dataset,
+                    batch_size=self.args.per_device_train_batch_size,
+                    collate_fn=self.data_collator,
+                    num_workers=self.args.dataloader_num_workers,


在普通的DataLoader会触发相关的问题吗？

不会，详见PR描述的卡住原因。

wawltor · 2024-08-15T07:22:30Z

paddlenlp/trainer/trainer.py

+                    batch_size=self.args.per_device_train_batch_size,
+                    collate_fn=self.data_collator,
+                    num_workers=self.args.dataloader_num_workers,
+                )

        train_sampler = self._get_train_sampler()


上面的逻辑是is_iterable_dataset，所以下面是非is_iterable_dataset的代码逻辑?

wawltor · 2024-08-15T07:23:37Z

paddlenlp/trainer/trainer.py

@@ -1694,6 +1726,8 @@ def _load_rng_state(self, checkpoint):

        if self.args.use_hybrid_parallel:
            if "hybrid_parallel_rng_state_tracker" in checkpoint_rng_state:
+                if self.args.tensor_parallel_degree <= 1:
+                    checkpoint_rng_state["hybrid_parallel_rng_state_tracker"].pop("model_parallel_rng", None)


这里触发hang住的原因，请文字说明清楚

这行不会触发hang住，只是修了bug。如果非tp但是rng_state里面有tp的种子，加载起来会报错。

在PR描述里解释了

ZHUI · 2024-08-15T08:06:16Z

paddlenlp/trainer/trainer.py

@@ -1398,12 +1398,15 @@ def get_train_dataloader(self):
            raise ValueError("We don't need train_dataset when should_load_dataset is False.")

        train_dataset = self.train_dataset
+        if self.args.distributed_dataloader:
+            is_iterable_dataset = self._is_iterable_dataset_dd(train_dataset)


Suggested change

is_iterable_dataset = self._is_iterable_dataset_dd(train_dataset)

is_iterable_dataset = self._is_iterable_dataset_distributed(train_dataset)

ZHUI · 2024-08-15T08:08:54Z

paddlenlp/trainer/trainer.py

+                    batch_size=self.args.per_device_train_batch_size,
+                    collate_fn=self.data_collator,
+                    num_workers=self.args.dataloader_num_workers,
+                    is_iterable_dataset=True,


可以使用一个 additional_args = {} 然后 **additional_args 传参。依然保持 DistDataLoader、 DataLoader 合并

不太行，因为Paddle的DataLoader不支持可变数量参数输入，除非修改Paddle。

ZHUI · 2024-08-15T08:11:00Z

paddlenlp/trainer/trainer.py

@@ -1694,6 +1726,8 @@ def _load_rng_state(self, checkpoint):

        if self.args.use_hybrid_parallel:
            if "hybrid_parallel_rng_state_tracker" in checkpoint_rng_state:
+                if self.args.tensor_parallel_degree <= 1:
+                    checkpoint_rng_state["hybrid_parallel_rng_state_tracker"].pop("model_parallel_rng", None)


ZHUI · 2024-08-15T08:11:25Z

tests/trainer/test_unified_checkpoint.py

@@ -1132,24 +1132,6 @@ def rerun(self, train_args):
            np.testing.assert_allclose(res[0], res[-1], rtol=self.rtol)


-@pytest.mark.skipif(True, reason="Skip for None CE")


为啥删了？

日后如果增加了ignore_merge_optimizer的选项，会和skip_save_model_weight产生冲突，所以删掉了。

ZHUI · 2024-08-15T08:12:48Z

paddlenlp/data/dist_dataloader.py

@@ -33,6 +33,11 @@ def __len__(self):
        return 0


+class IterableDummyDataset(paddle.io.IterableDataset):


我在想，是不是可以数据集那里，自己去构造 Fake的 dataset

不太理解什么意思，现在这么写我感觉没啥问题？

wawltor

LGTM

SylarTiaNII

需要修复

SylarTiaNII · 2024-09-06T04:58:44Z

paddlenlp/trainer/trainer.py

+        # For distributed dataloaer.
+        is_iterable_dataset_tensor = paddle.to_tensor(self._is_iterable_dataset(dataset)).reshape([1])
+        if dist.get_world_size() > 1:
+            dist.all_reduce(is_iterable_dataset_tensor, op=dist.ReduceOp.MAX)


NPU不支持bool类型通信，需要兼容

* fix ddloader, fix uc unittest * update dataloader

fix ddloader, fix uc unittest

de25e8a

wawltor reviewed Aug 15, 2024

View reviewed changes

update loader

21c7dd9

ZHUI reviewed Aug 15, 2024

View reviewed changes

update dd

65ab0ad

wawltor previously approved these changes Aug 15, 2024

View reviewed changes

DesmonDay dismissed wawltor’s stale review via 2384c4d August 15, 2024 12:26

update dataloader

fd9ffba

DesmonDay force-pushed the fix_dist_dataloader branch from 2384c4d to fd9ffba Compare August 15, 2024 12:28

ZHUI merged commit e8708ed into PaddlePaddle:develop Aug 16, 2024
9 of 12 checks passed

SylarTiaNII reviewed Sep 6, 2024

View reviewed changes

Mangodadada pushed a commit to Mangodadada/PaddleNLP that referenced this pull request Sep 10, 2024

[Trainer] Fix distributed dataloader (PaddlePaddle#8932)

aaf1ddd

* fix ddloader, fix uc unittest * update dataloader

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Trainer] Fix distributed dataloader #8932

[Trainer] Fix distributed dataloader #8932

DesmonDay commented Aug 14, 2024 •

edited

Loading

paddle-bot bot commented Aug 14, 2024

codecov bot commented Aug 14, 2024 •

edited

Loading

wawltor Aug 15, 2024

DesmonDay Aug 15, 2024

wawltor Aug 15, 2024

wawltor Aug 15, 2024

ZHUI Aug 15, 2024

DesmonDay Aug 15, 2024

DesmonDay Aug 15, 2024

ZHUI Aug 15, 2024

ZHUI Aug 15, 2024

DesmonDay Aug 15, 2024

ZHUI Aug 15, 2024

ZHUI Aug 15, 2024

DesmonDay Aug 15, 2024

ZHUI Aug 15, 2024

DesmonDay Aug 15, 2024

wawltor left a comment

SylarTiaNII left a comment

SylarTiaNII Sep 6, 2024

	is_iterable_dataset = self._is_iterable_dataset_dd(train_dataset)
	is_iterable_dataset = self._is_iterable_dataset_distributed(train_dataset)

		@@ -1132,24 +1132,6 @@ def rerun(self, train_args):
		np.testing.assert_allclose(res[0], res[-1], rtol=self.rtol)


		@pytest.mark.skipif(True, reason="Skip for None CE")

		@@ -33,6 +33,11 @@ def __len__(self):
		return 0


		class IterableDummyDataset(paddle.io.IterableDataset):

[Trainer] Fix distributed dataloader #8932

[Trainer] Fix distributed dataloader #8932

Conversation

DesmonDay commented Aug 14, 2024 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented Aug 14, 2024

codecov bot commented Aug 14, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wawltor left a comment

Choose a reason for hiding this comment

SylarTiaNII left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DesmonDay commented Aug 14, 2024 •

edited

Loading

codecov bot commented Aug 14, 2024 •

edited

Loading